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Abstract 

Stepwise methods are frequently employed in educational and 
psychological research, both to select useful subsets of variables 
and to evaluate the order of importance of variables. Three 
problems with stepwise applications are explored in some detail. 
First, computer packages use incorrect degrees of freedom in their 
stepwise computations, resulting in art if actually greater 
likelihood of obtaining spurious statistical significance. Second, 
stepwise methods do not correctly identify the best variable set of 
a given size, as illustrated by a concrete heuristic example. 
Third, stepwise methods tend to capitalize on sampling error, and 
thus tend to yield results that are not replicable. 



It is the practice within Educational and Psychological 
Measurement and other journals to present occasional supplementary 
guidelines for authors that complement general APA style 
requirements. For example, Thompson (1994b) discussed requirements 
involving both statistical significance testing and language 
regarding score reliability. The present paper focuses on major 
problems with stepwise analyses, and suggests that these methods 
ought to be avoided in favor of more suitable alternatives. 

Huberty (1994) recently noted that, "It is quite common to 
find the use of 'stepwise analyses' reported in empirically based 
journal articles" (p. 261) . However, various authors have 
presented scathing indictments of many of these applications (cf . 
Huberty, 1989; Snyder, 1991; Thompson, 1989). Three major problems 
can be noted. 

The heuristic examples emplo^^ed here to illustrate these three 
problems involve stepwise regression analysis. However, since all 
commonly applied analytic methods are correlational (Cohen, 1968) , 
and are special cases of canonical correlation analysis (Knapp, 
1978; Thompson, 1991), the present discussion generalizes across 
the full family of these various applications. 

Some researchers employ stepwise methods to select a subset of 
better variables from among a larger constellation of predictors, 
for use in present or future research (i.e., so-called "variable 
selection") . The methods are also sometimes used to interpret data 
dynamics , under a premise that selected variables are more 
important th^^ predictors that are not selected, or that entry 
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order reflects variable importance (i.e., so-called "variable 
ordering") . Stepwise methods are not usually useful for either 
purpose . 

Horrendouslv Wrong Degrees of Freedom 

Problem 

Degrees of freedom in statistical analyses reflect the number 
of unique pieces of information present for a given research 
situation. These degrees of freedom constrain the number of 
inquiries we may direct at our data, and are the currency we spend 
in analysis. 

Regrettably, commonly used statistical packages incorrectly 
compute the degrees of freedom in stepwise analyses. The use of 
incorrect degrees of freedom in practice often has dire 
consequences as regards the accuracy of our inferences. 

Table 1 presents an illustration. Presume that we have data 
from 101 subjects on a dependent variable ("Y") and 50 predictor 
variables. After five steps of stepwise regression analysis, the 
five entered predictor variables may "explain" 2 0% of the 
variability in the Y scores (i.e., 20/100 = 20% = R^) , as 
illustrated in Table 1. 

INSERT TABLE 1 ABOUT HERE. 

Computer packages compute the degrees of freedom correctly, as 
n-1. However, the degrees of freedom "explained" (also variously 
called "model", "regression", "between", ate.) is computed as the 
number of "entered" predictor variables (i.e., e^) . The degrees of 
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freedom "unexplained" (also variously called "error", "residual", 
"within", etc.) is then computed as n-l- pv * These calculations 
yield a statistically significant (ck=.05) result in the Table 1 
illustration. 

However, various researchers (cf . Snyder, 1991) have correctly 
noted that these degrees of freedom calculations for the explained 
and unexplained variance partitions are simply wrong. If the five 
entered predictor variables had been randomly selected, an 
explained degrees of freedom of 5 might be arguably correct. 

But our five predictors were selected by, at each step, 
looking at the results for all the predictor variables not yet 
entered! Viewed differently, at each step all 50 predictors 
variables were entered, though we may have constrained the b and jS 
weights for most of the predictors to be 0 at each step (Cliff, 
1987, p. 187). Thus, the computer packages are erroneously not 
charging us any degrees of freedom for consulting our data in this 
manner . 

This statistical welfare system may cause us to radically 
overestimate the atypicality of our results, i.e., create an 
artif actually small Ecalculated- Table 1 dramatically illustrates how 
the use of the incorrect degrees of freedom can (a) radically 
inflate MSexplained/ (b) radically deflate MSu^explau^, and consequently 

(c) very radically inflate Fcalculated (e.g., 4.75 versus 0.25). No 
wonder Cliff (1987, p. 185) noted that "most computer programs for 

[stepwise] multiple regression are positively satanic in their 
temptations toward Type I errors." 
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Caveats 

Of course, it is important in evaluating statistical practices 
not to make what in logic is termed an "is /ought" or a 
"should/would" error (Hudson, 1969; Hume, 1957). As Strike (1979) 
explains, 

To deduce a proposition with an "ought" in it from 
premises containing only "is" assertions is to get 
something in the conclusion not contained in the 
premises, something impossible in a valid deductive 
argument, (p. 13) 
The fact that most researchers "are" using the wrong degrees of 
freedom in their stepwise analyses does not mean that we therefore 
"should" abandon these methods. Instead, logically we ought simply 
to use the correct degrees of freedom. 

We need not even somehow persuade the software companies to 
fix their computer programs; we need only use the printed sums-of- 
squares instead with the correct degrees of freedom we derive 
ourselves to then recalculate the remaining statistical tests. 
Doing so merely requires a willingness to believe that computer 
programs are not infallible, because computer programs were written 
by fallible people and not by higher beings. 

It is important to note that all stepwise applications are not 
equally evil as regards the inflation of Type I error. For 
example, the stepwise results after one step for a problem 
involving only two predictors might not be so seriously distorted • 
Some readers may protest that no one would ever invoke stepwise 
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methods with a small number of predictor variables* However, a 
colleague only a few days ago described a manuscript for which he 
was serving as a referee, and in that study submitted to a 
prominent national journal the authors conducted several dozen 
stepwise methods for problems each involving only three predictor 
variables! 

The seriousness of problems with wrong degrees of freedom 
being used, as with most statistical (and life) issues, is 
situationally conditional. Stepwise methods will be somewhat less 
evil, for example, when (a) the sample size is very large, (b) the 
number of predictor variables is small, and/or (c) the sum of 
squares explained remains near zero across steps. 

Does Not Identify the Best Predictor Set of Size "q" 
Problem 

Unfortunately, many researchers erroneously believe that 
conducting two or five steps of analysis will identify the best 
predictor set of size two or five. This simply is not what stepwise 
methods typically do. 

Ignoring for present purposes the variable deletion aspect of 
a true stepwise analysis, at step number five forward stepwise 
methods address the question, "Given the four predictors already 
entered, which one additional predictor will most improve the 
analysis?". Thus, the question is conditioned on the presence of 
the first four predictors, and yields a situation-specific 
conditional answer in the context (a) only of .he specific 
variables already entered and (b) only those variables used in the 
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particular study but not yet entered. 

If the first variable entered was different, so the variable 
entered in the remaining steps might differ. Furthermore, even if 
the first four entered variables remained constant, deleting or 
adding predictors from the study certainly might also yield a 
different answer to the context-specific stepwise question. 

But if we wish to determine the best set of predictor 
variables of size q, the question, "what is the best set of g=5 
predictors?", does not ask a conditional question invoking a linear 
sequence of variable entry. Of course, if we desire this second 
question to be answered, it is not reasonable to invoke the answer 
to a question one is not posing i 

Thus, the five predictors entered in five steps of forward 
entry will not typically answer the question as to what are the 
best g=5 predictors, and it is even conceivable that none of the 
five variables selected by stepwise will be included in the best 
subset of five predictors. 

Figure 1 presents the Venn diagram of a heuristic example to 
make this dynamic concrete, since Venn diagrams are two-dimensional 
representations of multi-dimensional phenomena, they must be 
interpreted as only figurative portrayals of simultaneous 
relationships among three or more variables (Craeger, 1969) . 
However, bivariate relationships can be literally presented in this 
manner. 

INSERT FIGURE 1 ABOUT HERE. 

The example involves a dependent variable, X# and four 
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predictor variables. Table 2 presents sums-of-squares variance 
partitions associated with Figure 1, e.g., X, explains 100 of the 
400 sums-of-squares units associated with the individual 
differences (i.e., variability) in the Y scores. Table 3 translates 
the sums of squares into correlation coefficients. 

INSERT TABLES 2 AND 3 ABOUT HERE. 

Table 4 presents the regression analyses for the data. If a 
stepwise analysis was conducted, predictor Xj would be entered 
first, because this variable has the largest squared bivariate 
correlation (r^ = 25%) with Y. In the second step, predictor X2 
would be entered, and the resulting R^ would be 45.00%. 

INSERT TABLE 4 ABOUT HERE. 

However, if an all-possible-subsets analysis is conducted with 
the same data, the best predictor set of size g=2 is determined to 
be predictors Xj and X4, with. an R^ of 47.5%, The best predictor set 
of size g=2 does not include either of the two predictors entered 
in the two steps of the stepwise analysis! 
Caveats 

Again, few behaviors either in life or in statistics are 
always wrong. Some behaviors are only usually wrong, and we have 
to think about whether special exceptions have arisen. This is 
what makes teaching methodology so difficult — we must teach our 
students to think rather than only to memorize universal principles 
of lock-step rote behaviors. 

First, our two questions ("which one additional predictor...?" 
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and "what is the best set**.?") are logically equivalent when we 
are investigating the subset, g==l. Stepwise analysis does correctly 
identify the best single predictor* 

Second, the two types of analyses do yield the same answers 
whenever the predictors are perfectly uncorrelated. This occurs 
when we use orthogonally-rotated principal components scores in an 
analysis, for example. Of course, 3 0 steps of stepwise with such 
predictors tells us nothing we don't already know, if we already 
know the 30 correlation coefficient involving Y and each of the 3 0 
uncorrelated component scores. 

Tendency to Yield Non-replicable Results 

Problem 

Stepwise methods tend to yield conclusions that will not 
replicate in future research. This is because stepwise methods tend 
to capitalize outrageously on sampling error. Sampling error is 
variability in sample data that is unique to the given sample, and 
therefore cannot be reproduced in subsequent samples. Snyder 
(1991) presents an excellent heuristic example of these dynamics. 

At a given step, the determination of which single variable to 
enter will enter variable X, over variables Xj, X3, and X4, even if 
is only inf initesimally superior to the other three variables. 
It is entirely possible that this infinitesimal advantage of 
variable Xj over another variable is sampling error, given that the 
competitive advantage of Xj is so small. 

StepwiF,e analysis is a linear series of conditional decisions, 
not unlike the choices one makes in working through a maze. An 
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early mistake in the sequence will corrupt the remaining choices. 
If X, is incorrectly entered first in the analysis due to an 
infinitesimal advantage representing only a small amount of 
sampling error, all remaining conditional entry decisions may also 
therefore be incorrect. 

Since small differences may reflect sampling error, but these 
small differences can greatly effect the sample results, stepwise 
sample results often do not generalize. Thus, Cliff (1987, pp. 
120-121) suggested that, "a large proportion of the published 
results using this method probably present conclusions that are not 
supported by the data." 
Caveats 

Obviously, less sampling error tends to be present in data 
sets involving (a) larger samples, (b) fewer predictor variables, 
and (c) larger effect sizes, as reflected in the factors involved 
in most statistical corrections for positive bias in uncorrected 
variance-accounted-for effect sizes (Snyder & Lawson, 1933; 
Thompson, 1990). Thus, use of stepwise methods in these 
circumstances might be somewhat less sinful. And again, if the 
predictor variables are uncorrelated, the analysis is not distorted 
by the sampling error in the relationships among the predictors. 

Summary 

Stepwise methods do not do what most researchers believe the 
methods do. Stepwise methods are especially problematic when 
statistical significance tests are invoked to determine stopping 
positions, because the methods have all the problems associated 



with conventional statistical significance applications (Carver, 
1978; Cohen, 1994; Thompson, 1993, 1994a, 1994b, 1994c), in spades. 

As a general proposition, there are readily available software 
programs to assist with appropriate variable selection efforts by 
conducting almost instantly-available and painless all-possible- 
subsets analyses. Thus, stepwise analyses should be eschewed in 
favor of programs such as those offered by McCabe (1975) ,the Morris 
program distributed within Huberty's (1994) book, or SAS procedure 
RSQR. As regards interpretations involving the origins of 
explained variance, i.e., variable ordering, a useful alternative 
is simply to consult standardized weights (called different names 
across analyses to confuse graduate students, e.g., beta weights, 
factor pattern coefficients, standardized discriminant function 
coefficients) and structure coefficients (Thompson & Borrello, 
1985). Huberty (1994) summarizes a variety of other helpful 
variable ordering strategies for the discriminant analysis case. 
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Table 1 

Hypothetical Five-Step Regression Model 
\<Jith 101 Subjects and 50 Predictor Variables 



Analysis 


Source 


SOS 


df 


MS 


Fcalc 


Fcrit 


1 


Explained 


20 


5 


4.0000 


4.75 


4.41 20.00% 




Unexplained 


80 


95 


0.8421 








Total 


100 


100 








2 


Explained 


20 


50 


0.4000 


0.25 


***'^ 20.00% 




Unexplained 


80 


50 


1.6000 








Total 


100 


100 









•since Fcritical at infinite and infinite degrees of freedom equals 
1, an Fcalculated less than 1 can not be statistically significant. 

step.wkl 3/22/95 
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Table 2 

Variance Partitions of the Predictive 
Abilities of the Four Predictor Variables 



Single Partitions Partitions in Combinations 

Partition SOS Predictor Partitions Total 



A 


20 






E 


+ 


F 


+ 


G 






B 


50 






21 


+ 


49 


+ 


30 




100 


C 


27 






B 


+ 


C 


+ 


D 






D 


3 






50 


+ 


27 


+ 


3 




80 


E 


21 


X3 




A 


+ 


B 


+ 


E 






F 


49 






20 


+ 


50 


+ 


21 




91 


G 


30 


X4 




D 


+ 


G 


+ 


H 






H 


66 






3 


+ 


30 


+ 


66 




99 



Table 3 
Pairwise r Values 



Variable 


Common 






Pair 


SOS 




r 


Xj , X2 


0 


. 0000 


. 0000 


Xi / X3 


30 


.0750 


.2739 


Xj , X4 


60 


. 1500 


.3873 


Xi,Y 


100 


.2500 


. 5000 


X2/X3 


185 


.4625 


.6801 


X2/ X4 


3 


.0075 


.0866 


X2,Y 


80 


.2000 


.4472 


^3^X4 


0 


. 0000 


. 0000 


X3.Y 


91 


.2275 


.4770 


X4,Y 


99 


.2475 


• 4975 



Nolfee. = common SOS / 400. For example, y = 100/400 = +.2500, 
while rxi.Y = the square root of r^xiv = the' square root of +.2500 
= +.5000. 
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Table 4 

Calculation of jS's and R^'s for the 
Six Pairwise Combinations of the Four Predictors 



Predictors 


rl 


r2 


rxx 


R 


B (t1) 


+ 


H K*- ^ } 






1 2 


1 
A. 


.5000 


.4472 


.0000 


.5000 














2 


.4472 


.5000 


.0000 


.4472 


.2500 




.2000 




.4500 


1,3 


1 


.5000 


.4770 


. 2739 


. 3993 














3 


.4770 


.5000 


. 2739 


.3676 


. 1997 




. 1753 




.3750 


1,4 


1 


.5000 


. ^975 


. 3873 


.3616 














4 


.4975 


.5000 


.3873 


.3575 


. 1808 




. 1778 




.3586 


2,3 


2 


.4472 


.4770 


.6801 


.2285 














3 


.4770 


.4472 


.6801 


. 3215 


.1022 




.1534 




.2556 


2,4 


2 


.4472 


.4975 


^.0866 


.4072 














4 


.4975 


.4472 


^. 0866 


.4622 


.1821 




.2300 




.4121 


3,4 


3 


.4770 


.4975 


. 0000 


.4770 














4 


.4975 


.4770 


.0000 


.4975 


.2275 




.2475 




.4750 



Note. /3 = (rl - (r2 * rxx)) / (1 - rxx^) . For example, for 
predictor pair X, and X3, jSi = 

(.5000 - (.4770 * .2739)) / (1 - .2739^) 
(.5000 - .1306) / (1 - .0750) 

•3694 / .9250 = .3993 

R = 0(rl) + i8(r2). For example, for predictor pair X, and X3, = 
(.3993 * .5000) + (.3676 * .4770) 
•1997 + .1753 = .3750 
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Figure Caption. 



Figure 1 

Venn Diagram of Relationships Among Five Variables 
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